Preparation and assumed knowledge

  • Have a look at the Lab and make sure you have all packages installed on your computer.
  • For the exercises in this lab, we encourage you will work in groups as it will expose you to different views, ideas and opinions. That being said, please make an effort to complete the lab by typing your own code and (most importantly) expressing your own conclusions and interpretations.

Aims

  • Processing high-dimensional transcriptomics data and experience handling large \(p\) small \(n\) data.
  • Perform differential expression analysis using moderated t-tests.
  • Submit your answer to exercise 3.1 for formative feedback.
  • Build a ML model to predict patient rejection status from gene expression data and estimate the model’s accuracy.

Note: Any extension material won’t be assessed in the discipline quiz but may be of use for students who select this project as their interdisciplinary project.



1 Gene expression data

Kidney transplantation is often the treatment of choice for people with end-stage kidney disease. However, despite advances in the field of transplantation, the proportion of patients who develop graft rejection after a kidney transplant remains high. Understanding the characteristics between stable patients and patients who experience rejection can be one of the ways to improve health outcomes for these people.

In this exercise we will visualise public datasets on kidney transplant patients. These are RNA-sequencing datasets that measure the gene expression of patients’ cells. We will then use the gene expression to predict the patients’ outcome (i.e., stable versus rejection). The public datasets are all taken from the Gene Expression Omnibus database. You can download the dataset by looking up its GSE ID in the database. Information about each dataset, for example, the paper that this data is published in, the experimental protocol, can also be found on the database. We will first focus on how to analyse the data from “GSE46474”. This dataset contains the gene expression profiles of 40 blood samples. Of those, 20 patients developed graft rejection and 20 had stable grafts and will be treated as controls. Note, you can go to the package website for installation details.

library(GEOquery) 
library(R.utils)
library(reshape2)
library(ggplot2)
library(limma)
library(dplyr)
## gse <- getGEO("GSE46474")
##gse <- gse$GSE46474_series_matrix.txt.gz
load("data/GSE46474.RData")
gse
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54613 features, 40 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM1130812 GSM1130813 ... GSM1130851 (40 total)
##   varLabels: title geo_accession ... tissue:ch1 (46 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_s_at 1053_at ... NA.25224 (54613 total)
##   fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 25387159 
## Annotation: GPL570

1.1 Basic statistics

The actual expression matrix can be extracted using the exprs function. Data about the phenotype of each patient can be extracted using pData function. Similarly, we can extract the featureData object using the fData function.

head(pData(gse)[,1:5])
head(fData(gse)[,1:5])
head(exprs(gse)[,1:5])
eMat = exprs(gse)
gse$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable") #Tidy the title variable and call it Outcome. 
table(gse$Outcome)
## 
## Rejection    Stable 
##        20        20
## head(pData(gse)[,1:5])
## head(fData(gse)[,1:5])
## head(exprs(gse)[,1:5])
eMat = exprs(gse)

(a) What is the number of samples in the Rejection class and the Stable class?

Suggestions:

gse$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable") #Tidy the title variable and call it Outcome. 
table(gse$Outcome)
## 
## Rejection    Stable 
##        20        20

(b) What is the size of this data matrix?

Suggestions:

dim(gse)
## Features  Samples 
##    54613       40
dim(eMat)
## [1] 54613    40

It has 54613 rows and 40 columns, meaning for each of the 40 patients, there are gene expression measurements across 54613 genes.

(c) [Exploration - optional] The names of these features aren’t “genes” but the ID generated by the biotechnology company that created this platform. In this case, the name of the manufacture is call Affymetrix and the name of their platform is known as Affymetrix Human Genome U133 Plus 2.0 Array. This platform aims to provide the complete coverage of the Human Genome for analysis. Details of the platform can be found at the company’s website. Here, the information gene symbols are stored in the column Gene Symbol and the description of the genes are stored in the column named Gene Title.

rownames(gse)[1:10]
##  [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"  
##  [7] "1316_at"   "1320_at"   "1405_i_at" "1431_at"
print(rownames(eMat[1:50, ]))
##  [1] "1007_s_at"    "1053_at"      "117_at"       "121_at"       "1255_g_at"   
##  [6] "1294_at"      "1316_at"      "1320_at"      "1405_i_at"    "1431_at"     
## [11] "1438_at"      "1487_at"      "1494_f_at"    "1552256_a_at" "1552257_a_at"
## [16] "1552258_at"   "1552261_at"   "1552263_at"   "1552264_a_at" "1552266_at"  
## [21] "1552269_at"   "1552271_at"   "1552272_a_at" "1552274_at"   "1552275_s_at"
## [26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"  
## [31] "1552281_at"   "1552283_s_at" "1552286_at"   "1552287_s_at" "1552288_at"  
## [36] "1552289_a_at" "1552291_at"   "1552293_at"   "1552295_a_at" "1552296_at"  
## [41] "1552299_at"   "1552301_a_at" "1552302_at"   "1552303_a_at" "1552304_at"  
## [46] "1552306_at"   "1552307_a_at" "1552309_a_at" "1552310_at"   "1552311_a_at"
featureData <- fData(gse) # GSE contain a featuredata slot that contains information about the genes that were sampled. 
colnames(featureData)
##  [1] "ID"                               "GB_ACC"                          
##  [3] "SPOT_ID"                          "Species Scientific Name"         
##  [5] "Annotation Date"                  "Sequence Type"                   
##  [7] "Sequence Source"                  "Target Description"              
##  [9] "Representative Public ID"         "Gene Title"                      
## [11] "Gene Symbol"                      "ENTREZ_GENE_ID"                  
## [13] "RefSeq Transcript ID"             "Gene Ontology Biological Process"
## [15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
head(featureData[,"Gene Symbol"])
## [1] "DDR1 /// MIR4640" "RFC2"             "HSPA6"            "PAX8"            
## [5] "GUCA1A"           "MIR5193 /// UBA7"

1.2 Data transformation

(a) Log transformation is a standard normalisation technique for RNA-seq data. Has our dataset has been log2 transformed? How can we determine this from looking at the data?

Suggestions:

summary(eMat[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.683   3.357   4.564   4.758   5.834  14.988

(b) Save a copy of this processed dataset for future usage using write.csv.

write.csv(exprs(gse), "data/GSE46474_expression_matrix.txt")

1.3 Visualisation

It is critical that we examine the dataset. Create a series of boxplots to compare the sample distribution among different people. Are there any samples that appear to be significantly different from the others?  Is it necessary to remove any samples (patients)?

[Discussion] Can you think of any other graphical techniques to visualise or analyse the data?

Suggestions:

library(reshape2)
## base R code is
## boxplot(eMat) or boxplot(melt(exprs(gse)))

p <- ggplot(melt(exprs(gse)), aes(x=Var2, y=value)) +  
  geom_boxplot(outlier.colour="black", outlier.shape=16,outlier.size=0.5, notch=FALSE) +
  theme(axis.text.x = element_text(angle = 180, hjust = 1)) +
  labs (x = "patient", y = "expression value") + theme_minimal()
p

summary(melt(exprs(gse))$value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.677   3.279   4.488   4.755   5.863  15.045

2 Association analysis: Finding signals in the data

2.1 t-test within one gene

Let us select row 8251 and have a look. This represents the gene expression from a selected gene. Calculate the two sample t-statistic which could be used to test the null hypothesis, that the mean log(expression) for rejection individuals is the same as that for stable individuals. Assume that this t-statistic does indeed have a t-distribution; obtain the p-value for testing this hypothesis. Check for normality of the data using a qqplot.

Suggestions:

eMat[8251,]
## GSM1130812 GSM1130813 GSM1130814 GSM1130815 GSM1130816 GSM1130817 GSM1130818 
##   8.634013   8.588576   9.420062   8.333498   9.635923   8.039824   8.824033 
## GSM1130819 GSM1130820 GSM1130821 GSM1130822 GSM1130823 GSM1130824 GSM1130825 
##   8.448080   8.578168   7.771895   8.159828   8.693350   8.835891   8.883193 
## GSM1130826 GSM1130827 GSM1130828 GSM1130829 GSM1130830 GSM1130831 GSM1130832 
##   9.010599   6.970322  10.048063   8.561478   8.824975   6.111392   9.425899 
## GSM1130833 GSM1130834 GSM1130835 GSM1130836 GSM1130837 GSM1130838 GSM1130839 
##   7.175468   8.835151   7.914015  10.413707   8.845688   8.146591   7.256471 
## GSM1130840 GSM1130841 GSM1130842 GSM1130843 GSM1130844 GSM1130845 GSM1130846 
##   8.467151   7.014702   9.046717   7.729293   8.965878   8.498529   9.658861 
## GSM1130847 GSM1130848 GSM1130849 GSM1130850 GSM1130851 
##   8.868084   8.849542   7.595218   8.984595   7.941705
boxplot(eMat[8251,] ~ gse$Outcome, ylab="log2 expression")

t.test(eMat[8251,] ~ gse$Outcome)
## 
##  Welch Two Sample t-test
## 
## data:  eMat[8251, ] by gse$Outcome
## t = 5.0164, df = 35.523, p-value = 1.474e-05
## alternative hypothesis: true difference in means between group Rejection and group Stable is not equal to 0
## 95 percent confidence interval:
##  0.6409244 1.5115623
## sample estimates:
## mean in group Rejection    mean in group Stable 
##                9.038282                7.962039

2.2 DE genes

What are the highly differentially expressed genes between stable and rejection patients? This is the same question as which genes have the most different expression between stable and rejection patients. We can perform a series of t-tests, and the bioinformatics community has developed a package that speeds up this process. We will examine a package called limma which fits a series of linear models \[ Y = X\beta\] where \(X\) is known as the design matrix. The theory and methods behind all this are covered in STAT3022 - Applied linear models.

design <- model.matrix(~Outcome, data = pData(gse))
fit <- lmFit(exprs(gse), design)
fit <- eBayes(fit)

## Without gene symbols
library(DT)
tT <- topTable(fit, n = Inf) 
DT::datatable(round(tT[1:100, ], 2))
## Adding gene symbols
tT <- topTable(fit, genelist=featureData[,"Gene Symbol"], n = 20)

2.3 Multiple testing

Explore: Run the following code and explore the impact of multiple testing. Convince yourself via simulation.

cl <- factor(sample(c("YES", "NO"), 80, replace=TRUE))
fakeX <- matrix(rnorm(10000*80), nrow=10000)
design <- model.matrix(~ cl + 0 )
fakefit <- lmFit(fakeX, design)
cont.matrix <- makeContrasts(clYES - clNO, levels=design)
fakefit2 <- contrasts.fit(fakefit, cont.matrix)
fakefit2 <- eBayes(fakefit2)

Suggestions:

cl <- factor(sample(c("YES", "NO"), 80, replace=TRUE))
fakeX <- matrix(rnorm(10000*80), nrow=10000)
design <- model.matrix(~ cl + 0 )
fakefit <- lmFit(fakeX, design)
cont.matrix <- makeContrasts(clYES - clNO, levels=design)
fakefit2 <- contrasts.fit(fakefit, cont.matrix)
fakefit2 <- eBayes(fakefit2)
DT::datatable(round(topTable(fakefit2), 2))
qqnorm(fakefit2$t)
abline(0,1)

2.4 Visualising results

We will examine our analytical results using two types of scatter plots

  • MA-plot: logFC (y-axis) vs AveExpr (x-axis) where M stands for Minus and A stands for Add.
  • Volcano plot: -log(P.Value) (y-axis) vs logFC (x-axis).
  • [Extension] Consider making these plots interactive so that you can locate or identify the genes of interest.
## Code for MA plot
df<- topTable(fit, number=nrow(fit), genelist=rownames(gse))
ggplot(df, aes(x = AveExpr, y = logFC))+
    geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
    scale_colour_gradient(low="blue",high="red")+
    ylab("log2 fold change") + xlab("Average expression")
    
## Code to extract negative log p-value
p.value = -log10(df$P.Value))

Suggestions:

df<- topTable(fit, number=nrow(fit), genelist=rownames(gse))

p <- ggplot(df, aes(x = AveExpr, y = logFC))+
    geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
    scale_colour_gradient(low="blue",high="red")+
    ylab("log2 fold change") + xlab("Average expression")
p

p <- ggplot(df, aes(logFC,-log10(P.Value)))+
    geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
    scale_colour_gradient(low="blue",high="red")+
    xlab("log2 fold change") + ylab("-log10 p-value")
p

3 Building predictive model

3.1 Exploring with PCA

It would be nice if we can visualise our data with a 2D scatter-plot. However, for each patient, there are more than 10000 variables. How are we going to visualise this information for each patient? Similar to Lab 1, we can use a dimension reduction technique called PCA. PCA helps to reduce or summarise the information of a high dimensional object into a lower dimension. We can then use the first and second dimension, as the summary information of the >10000 genes, to plot a 2D scatter-plot. The PCA plot frequently gives us a good indication of how simple or difficult the relevant prediction task is. Please submit this plot for formative feedback and comment (1 sentence only) on what you learn from looking at this graph. Take care to make sure the graphic is properly labelled.

Suggestions:

gse_pca <- prcomp(t(exprs(gse)))
df_toplot <- data.frame(gse$Outcome, 
                        pc1 = gse_pca$x[,1], pc2 = gse_pca$x[,2]  )
g <- ggplot(df_toplot, aes(x = pc1, y = pc2, color = gse.Outcome)) + 
  geom_point(size = 4) + 
  theme_minimal() 
g

From the scatter plot, each patient is represented by a data point, coloured by their rejection status. We can see that the rejection patients cannot be easily distinguished from stable patients. You might consider some filtering strategies before PCA.

3.2 Prediction model

Let’s apply the machine learning models we have learnt in the previous week to predict patient rejection status from gene expression data. One of the key challenges of building a classification model for biomedical data is facing the large p small n challenge. This means the number of variables (\(p\)) is often a lot more than the number of samples (\(n\)) which creates a lot of challenges, we address this in a ad hoc way by selecting a subset of genes (features). The following code will help you get started.

## quick filter to reduce computational time
largevar = apply(gse, 1, var)
ind = which(largevar > quantile(largevar, 0.95))

X = as.matrix(t(exprs(gse[ind,])))
y = gse$Outcome

cvK = 5  # number of CV folds
cv_50acc5_knn = cv_50acc5_svm = cv_50acc5_rf = c()
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()

n_sim = 25 ## number of repeats

for (i in 1:n_sim) {
  cvSets =  <write your own code for KNN>  # permute all the data, into 5 folds
  cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
  
  for (j in 1:cvK) {
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    ## KNN
    fit5 = <write your own code for KNN>
    cv_acc_knn[j] = <write your own code for KNN>
    
    ## SVM
    svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
    fit <- predict(svm_res, X_test)
    cv_acc_svm[j] = <write your own code for KNN>

    ## RandomForest
    rf_res <- randomForest::randomForest(x = X_train, y = as.factor(y_train))
    fit <- predict(rf_res, X_test)
    cv_acc_rf[j] = <write your own code for KNN>
  }
  cv_50acc5_knn <- append(cv_50acc5_knn, mean(cv_acc_knn))
  cv_50acc5_svm <- <write your own code for KNN>
  cv_50acc5_rf <- <write your own code for KNN>
} ## end for

Suggestions:

## quick filter to reduce computational time

largevar = apply(gse, 1, var)
ind = which(largevar > quantile(largevar, 0.9))

X = as.matrix(t(exprs(gse[ind,])))
y = gse$Outcome

cvK = 5  # number of CV folds
cv_50acc5_knn = cv_50acc5_svm = cv_50acc5_rf = c()
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()

n_sim = 25 ## number of repeats
for (i in 1:n_sim) {

  cvSets = cvTools::cvFolds(nrow(X), cvK)  # permute all the data, into 5 folds
  cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
  
  for (j in 1:cvK) {
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    ## KNN
    fit5 = class::knn(train = X_train, test = X_test, cl = y_train, k = 5)
    cv_acc_knn[j] = mean(fit5 == y_test)
    
    ## SVM
    svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
    fit <- predict(svm_res, X_test)
    cv_acc_svm[j] = mean(fit == y_test)

    ## RandomForest
    rf_res <- randomForest::randomForest(x = X_train, y = as.factor(y_train))
    fit <- predict(rf_res, X_test)
    cv_acc_rf[j] = mean(fit == y_test)
  }
  cv_50acc5_knn <- append(cv_50acc5_knn, mean(cv_acc_knn))
  cv_50acc5_svm <- append(cv_50acc5_svm, mean(cv_acc_svm))
  cv_50acc5_rf <- append(cv_50acc5_rf, mean(cv_acc_rf))
} ## end for

Visualise the comparison

boxplot(list(SVM = cv_50acc5_svm, KNN = cv_50acc5_knn , RF= cv_50acc5_rf ), ylab="CV Accuracy")

3.3 Assignment 1

One of the assignment 1 questions will be based on the following task.

Blood vs Biopsy Biomarker.

We estimated the accuracy for our predictive model in graft rejection from a peripheral blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself. Here we will look at a similar kidney microarray dataset. But instead of genes being isolated and sequenced from blood, they have been sequenced from a kidney biopsy. How do these differ when compared to the peripheral blood models we generated earlier? Can you make a informative plot?

[Optional] What reasons can you think of that account for this difference in model accuracy?

The code below illustrate how to download another dataset. We will also make GSE138043.RData available.

gse <- getGEO("GSE138043")
gse <- gse$GSE138043_series_matrix.txt.gz
gse <- load("GSE138043.RData")

library(preprocessCore)
exprs(gse) <- normalize.quantiles(exprs(gse)) # Normalise Data
boxplot(normalize.quantiles(exprs(gse)))

4 [Extension] Prepartion for Week 2 lecture

This section is optional as a head-start to the next case study which looks at data used to detect eye movement. A sequence of data from the Spiker box was created by two physics tutors. These signals are recorded and saved as WAV files. We’ll start our investigation by looking at the data in the file LRL L3.wav.

Data loading: With the code below, we’ll read the .wav file using the tuneR package’s readWave function. We use two functions unique to the S4 class to retrieve information:
- The function slotNames returns information about the components (termed slots) for a given object.
- To extract a specific slots of an S4 object, you use @ such as waveSeq@left. If you’re curious, you may learn all about the S4 object framework here, however it’s not needed for the course.

Explanations of the slotNames:
- left: Recordings (as sample point) in the left channel.
- right: Recordings (as sample point) in the right channel.
- stereo: Whether the sounds are recorded in stereo.
- samp.rate: The number of samples per second. If the sampling rate is 4000 hertz, a recording with a duration of 5 seconds will contain 20,000 samples.
- bit: the number of bits in each sample, and can be considered as a form of resolution.
- pcm: stands for Pulse Code Modulation, is a method to capture audio waveforms digitally.

## install.packages(tuneR)
library(tidyverse)
library(tuneR)
library(ggplot2)

waveSeq <- readWave("data/LRL_L3.wav")
waveSeq
## 
## Wave Object
##  Number of Samples:      72688
##  Duration (seconds):     7.27
##  Samplingrate (Hertz):   10000
##  Channels (Mono/Stereo): Mono
##  PCM (integer format):   TRUE
##  Bit (8/16/24/32/64):    16
slotNames(waveSeq)
## [1] "left"      "right"     "stereo"    "samp.rate" "bit"       "pcm"
# time (in second) of the sequencing
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate 

Quick visualization: From the code above, we see that there are 72688 values and we have 10000 observations per second. The code below provide a simple way visualize the data using the R base function. The three eye movements in this file are displayed in the title of the file “LRL” denoting a left eye movement followed by a right and left eye movement. Visually can you distinguish the pattern from left eye movement versus right eye movement?

plot(timeSeq, waveSeq@left, type = "l", ylab="Signal", xlab="Time(seconds)")

ggPlot visualization: In this data, we denote any eye movement as an event, so across the \(x\)-axis, we have values identified as part of a left/right eye movement (event) or no eye movement (no event). From the figure, we observed that the three events (left, right, left) that occur at event times are 1.8, 3.5 and 5 seconds, respectively. The following code used ggplot to graph the data and highlight an interval of 1 second over the three events. Take a look to see if you can understand the R code, and we will go over it further in the Week 2 lecture and Week 3 lab.

## Define the table by eye
eventTable <- data.frame(eventTime = c(1.8, 3.5, 5), 
                         eventType= c("Left", "Right", "Left"))

## Construct a data frame (df)
df <- data.frame(Y = waveSeq@left, 
                 time = timeSeq,
                 event_type = "none",
                 event_time = NA,
                 event_pos = NA)

## Check class type 
sapply(df, class)
##           Y        time  event_type  event_time   event_pos 
##   "integer"   "numeric" "character"   "logical"   "logical"
df$event_type = as.character(df$event_type)

## Create highlights
for (i in 1:nrow(eventTable)) {
  t_idx <- (df$time < (eventTable[i, 1] + 0.5)) & (df$time > (eventTable[i, 1] - 0.5))
  df$event_type[t_idx] <- as.character(eventTable[i, 2])
  df$event_time[t_idx] <- seq_len(sum(t_idx))
  df$event_pos[t_idx] <- i
}

df$event_type <- factor(df$event_type, levels = c("Right", "Left", "none"))
ggplot(df, aes(x = time, y = Y, col = event_type, group = 1)) +  geom_line() +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "black")) +
  theme_bw()